End-to-End LLM Model Development with Torchtitan and Torchtune #341

KeitaW · 2024-05-20T12:18:20Z

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

SMHP: Remove 14k log lines from efa exporter LCC

Add conda and docker environment setups for 16.pytorch-capu-ddp test case.

Bump dcgm exporter version to correctly capture GPU utilization

NCCL 2.19.4 has performance regression.

Change nccl version to 2.20.3

Update 3.container-train.sbatch

This reverts commit da7a51d.

Typo in the name.

Rename 0.crate-conda-env.sh to 0.create-conda-env.sh

Updating CF template for HyperPod to support second private subnet

smp v2 llama2 training example using fp8

Update 1.conda-train.sbatch

Signed-off-by: Sean Smith <[email protected]>

Validate Json in preflight check

…distributed-training into torchtitan-torchtune

KeitaW · 2024-06-11T00:48:26Z

Basic functionalities have been implemented. Allow me to iterate on the other PRs...

3.test_cases/torchtune/slurm/README.md

3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/README.md

Co-authored-by: Pavel Belevich <[email protected]>

…ent/README.md Co-authored-by: Pavel Belevich <[email protected]>

Co-authored-by: Pavel Belevich <[email protected]>

pbelevich · 2024-06-11T17:31:43Z

3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/README.md

+* Evaluation
+* Deployment
+
+for details of each step, refer the [overview documentation](../../README.md).


Suggested change

for details of each step, refer the [overview documentation](../../README.md).

for details of each step, refer to the [overview documentation](../../README.md).

pbelevich · 2024-06-11T18:35:21Z

3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/README.md

+In this step, you will fine-tune the Llama3 model starting from the original checkpoint using the WikiText dataset. This process, known as Full-Parameter Finetuning, updates all the parameters in the original model. The configuration file used for this process is `./tutorials/e2e-llama3-70b-development/full_finetune_distributed.yaml`.
+
+### Memory Consumption Challenges
+One of the primary challenges during such training is memory consumption. A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory (6 bytes for parameters in mixed precision training, 8 bytes for AdamW, and 4 bytes for other overheads). For more details on the anatomy, see the [Hugging Face blog post](https://huggingface.co/docs/transformers/model_memory_anatomy) blog post. This means that training a 70B parameter model would require more than 1.12 TB of accelerated memory, which far exceeds the 80 GB capacity of H100 accelerated memory. To address this issue, torchtune integrates PyTorch Fully Sharded Data Parallel (FSDP).


Suggested change

One of the primary challenges during such training is memory consumption. A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory (6 bytes for parameters in mixed precision training, 8 bytes for AdamW, and 4 bytes for other overheads). For more details on the anatomy, see the [Hugging Face blog post](https://huggingface.co/docs/transformers/model_memory_anatomy) blog post. This means that training a 70B parameter model would require more than 1.12 TB of accelerated memory, which far exceeds the 80 GB capacity of H100 accelerated memory. To address this issue, torchtune integrates PyTorch Fully Sharded Data Parallel (FSDP).

One of the primary challenges during such training is memory consumption. A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter(6 bytes for parameter in mixed precision training, 4 bytes for gradient and 8 bytes for AdamW optimizer states) plus activation memory. For more details on the anatomy, see the [Hugging Face blog post](https://huggingface.co/docs/transformers/model_memory_anatomy) blog post. This means that training a 70B parameter model would require more than 1.12 TiB of accelerator's memory, which far exceeds the 80 GB capacity of H100 memory. To address this issue, torchtune integrates PyTorch Fully Sharded Data Parallel (FSDP).

pbelevich · 2024-06-11T18:37:23Z

3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/README.md

+In this step, you will fine-tune the Llama3 model starting from the original checkpoint using the WikiText dataset. This process, known as Full-Parameter Finetuning, updates all the parameters in the original model. The configuration file used for this process is `./tutorials/e2e-llama3-70b-development/full_finetune_distributed.yaml`.
+
+### Memory Consumption Challenges
+One of the primary challenges during such training is memory consumption. A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory (6 bytes for parameters in mixed precision training, 8 bytes for AdamW, and 4 bytes for other overheads). For more details on the anatomy, see the [Hugging Face blog post](https://huggingface.co/docs/transformers/model_memory_anatomy) blog post. This means that training a 70B parameter model would require more than 1.12 TB of accelerated memory, which far exceeds the 80 GB capacity of H100 accelerated memory. To address this issue, torchtune integrates PyTorch Fully Sharded Data Parallel (FSDP).


How was 1.12 TiB calculated?
70_000_000_000 * 18 = 1_260_000_000_000
1_260_000_000_000 / 1024 / 1024 / 1024 / 1024 = 1.15TiB

pbelevich · 2024-06-11T18:43:21Z

3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/README.md

+In this step, you will fine-tune the Llama3 model starting from the original checkpoint using the WikiText dataset. This process, known as Full-Parameter Finetuning, updates all the parameters in the original model. The configuration file used for this process is `./tutorials/e2e-llama3-70b-development/full_finetune_distributed.yaml`.
+
+### Memory Consumption Challenges
+One of the primary challenges during such training is memory consumption. A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory (6 bytes for parameters in mixed precision training, 8 bytes for AdamW, and 4 bytes for other overheads). For more details on the anatomy, see the [Hugging Face blog post](https://huggingface.co/docs/transformers/model_memory_anatomy) blog post. This means that training a 70B parameter model would require more than 1.12 TB of accelerated memory, which far exceeds the 80 GB capacity of H100 accelerated memory. To address this issue, torchtune integrates PyTorch Fully Sharded Data Parallel (FSDP).


memory is not accelerated itself

pbelevich · 2024-06-11T18:44:47Z

3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/README.md

+
+### Basic concepts and relevant configuration
+
+**FSDP** is a distributed training feature designed to efficiently handle large model training by sharding model parameters, gradients, and optimizer states across multiple devices. This approach significantly reduces memory consumption and optimizes resource utilization, making it possible to train models that are too large to fit on a single GPU. In `torchtune` users can launch FSDP training job with command `tune run full_finetune_distributed`.  


Suggested change

**FSDP** is a distributed training feature designed to efficiently handle large model training by sharding model parameters, gradients, and optimizer states across multiple devices. This approach significantly reduces memory consumption and optimizes resource utilization, making it possible to train models that are too large to fit on a single GPU. In `torchtune` users can launch FSDP training job with command `tune run full_finetune_distributed`.

**FSDP** is a distributed training technique designed to efficiently handle large model training by sharding model parameters, gradients, and optimizer states across multiple devices. This approach significantly reduces memory consumption and optimizes resource utilization, making it possible to train models that are too large to fit on a single GPU. In `torchtune` users can launch FSDP training job with command `tune run full_finetune_distributed`.

pbelevich · 2024-06-11T18:49:52Z

..._cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/lora_finetune_distributed.sbatch

+    --master_port $RANDOM 
+    --nproc_per_node=8 
+    --nnodes $NNODES 
+    --nnodes=$SLURM_JOB_NUM_NODES 


--nnodes twice

pbelevich · 2024-06-11T18:50:10Z

3.test_cases/torchtitan/pretrain.sbatch

+    --master_port $RANDOM 
+    --nproc_per_node=8 
+    --nnodes $NNODES 
+    --nnodes=$SLURM_JOB_NUM_NODES 


--nnodes twice

pbelevich · 2024-06-11T18:52:35Z

3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/README.md

+sbatch tutorials/e2e-llama3-70b-development/full_finetune_distributed.sbatch
+```
+
+By default, this script launches the FSDP training job with two instances. Once the job has been scheduled, you will see the following outputs in the log file named `logs/full-finetuning*`:


I don't see where two instances are specified by default, I see only --nnodes 1 --nnodes=1 in sbatch files

KeitaW and others added 30 commits March 17, 2024 08:28

Update README.md

0e7a427

Update README

d6e56b2

update scripts

5c9ecae

update log file name

b30d471

Remove 14k log lines

5d160c6

Merge pull request #216 from aws-samples/reduce-build-log-efa-exporter

736a029

SMHP: Remove 14k log lines from efa exporter LCC

Merge pull request #215 from aws-samples/pytorch-cpu-ddp-conda-enroot

ae6b020

Add conda and docker environment setups for 16.pytorch-capu-ddp test case.

Merge pull request #214 from aws-samples/smph-fix-dcgm-exporter-gpu-util

61ddfa5

Bump dcgm exporter version to correctly capture GPU utilization

Change nccl version to 2.20.3

098c222

NCCL 2.19.4 has performance regression.

Merge pull request #217 from aws-samples/nccl_tests_version_changes

0e77ca5

Change nccl version to 2.20.3

smp v2 llama2 training example using fp8

1a88359

Update 3.container-train.sbatch

318f9d9

Merge pull request #221 from aws-samples/KeitaW-patch-1

3dc358c

Update 3.container-train.sbatch

Added second subnet for other AWS services which require multi-AZ

bea1b68

Removed FSXSecurityGroup as it is unused

da7a51d

Renamed resources to Primary/Backup Subnet

1de3e5a

Revert "Removed FSXSecurityGroup as it is unused"

018f4e9

This reverts commit da7a51d.

Merge branch 'hyperpod_backup_subnet_20240326'

a474f65

Rename 0.crate-conda-env.sh to 0.create-conda-env.sh

447d45c

Typo in the name.

Merge pull request #225 from aws-samples/sean-smith-patch-2

c6a146b

Rename 0.crate-conda-env.sh to 0.create-conda-env.sh

Deleted unused security group FSXSecurityGroup

67d6af7

Merge pull request #222 from shimomut/main

8d59eef

Updating CF template for HyperPod to support second private subnet

Added comments to conda setup scripts

ac8f5bd

Merge pull request #218 from aruncs2005/main

44701fd

smp v2 llama2 training example using fp8

Update 1.conda-train.sbatch

73b2ccb

Update 3.container-train.sbatch

4aa19c5

Merge pull request #229 from aws-samples/KeitaW-patch-1

437783a

Update 1.conda-train.sbatch

updated pytorch version to 2.2

c608899

Validate Json in preflight check

3adaa5c

Signed-off-by: Sean Smith <[email protected]>

Merge pull request #233 from aws-samples/validate-json

7d25c4a

Validate Json in preflight check

KeitaW added 4 commits May 31, 2024 12:51

update

d4029d2

update

ae98bf9

update

64e0724

update

00dfbf5

KeitaW force-pushed the torchtitan-torchtune branch from 3ae455a to 64e0724 Compare June 3, 2024 22:53

KeitaW force-pushed the main branch from 8dc7dc0 to 44e448e Compare June 3, 2024 22:53

Merge branch 'torchtitan-torchtune' of github.com:aws-samples/awsome-…

b929043

…distributed-training into torchtitan-torchtune

KeitaW force-pushed the main branch from 44e448e to 1209815 Compare June 4, 2024 02:26

KeitaW force-pushed the torchtitan-torchtune branch from 64e0724 to 00dfbf5 Compare June 4, 2024 02:26

KeitaW force-pushed the main branch 2 times, most recently from 44e448e to 1209815 Compare June 4, 2024 02:30

KeitaW added 2 commits June 4, 2024 02:31

Merge branch 'torchtitan-torchtune' of github.com:aws-samples/awsome-…

4ac5496

…distributed-training into torchtitan-torchtune

update

952eba3

KeitaW force-pushed the torchtitan-torchtune branch from 436b58c to 952eba3 Compare June 4, 2024 03:32

update

563e807

KeitaW requested a review from pbelevich June 5, 2024 02:18

KeitaW marked this pull request as ready for review June 5, 2024 02:18

Merge branch 'main' into torchtitan-torchtune

71c33f6

pbelevich requested changes Jun 11, 2024

View reviewed changes

3.test_cases/torchtune/slurm/README.md Outdated Show resolved Hide resolved

3.test_cases/torchtune/slurm/README.md Outdated Show resolved Hide resolved

pbelevich reviewed Jun 11, 2024

View reviewed changes

3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/README.md Outdated Show resolved Hide resolved

KeitaW and others added 3 commits June 11, 2024 14:23

Update 3.test_cases/torchtune/slurm/README.md

77d4908

Co-authored-by: Pavel Belevich <[email protected]>

Update 3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-developm…

0133094

…ent/README.md Co-authored-by: Pavel Belevich <[email protected]>

Update 3.test_cases/torchtune/slurm/README.md

f8833b7

Co-authored-by: Pavel Belevich <[email protected]>

pbelevich requested changes Jun 11, 2024

View reviewed changes

pbelevich reviewed Jun 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End-to-End LLM Model Development with Torchtitan and Torchtune #341

End-to-End LLM Model Development with Torchtitan and Torchtune #341

KeitaW commented May 20, 2024

KeitaW commented Jun 11, 2024

pbelevich Jun 11, 2024

pbelevich Jun 11, 2024 •

edited

Loading

pbelevich Jun 11, 2024

pbelevich Jun 11, 2024 •

edited

Loading

pbelevich Jun 11, 2024

pbelevich Jun 11, 2024

pbelevich Jun 11, 2024

pbelevich Jun 11, 2024 •

edited

Loading

	for details of each step, refer the [overview documentation](../../README.md).
	for details of each step, refer to the [overview documentation](../../README.md).


		### Basic concepts and relevant configuration

		FSDP is a distributed training feature designed to efficiently handle large model training by sharding model parameters, gradients, and optimizer states across multiple devices. This approach significantly reduces memory consumption and optimizes resource utilization, making it possible to train models that are too large to fit on a single GPU. In `torchtune` users can launch FSDP training job with command `tune run full_finetune_distributed`.

End-to-End LLM Model Development with Torchtitan and Torchtune #341

Are you sure you want to change the base?

End-to-End LLM Model Development with Torchtitan and Torchtune #341

Conversation

KeitaW commented May 20, 2024

KeitaW commented Jun 11, 2024

pbelevich Jun 11, 2024

Choose a reason for hiding this comment

pbelevich Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

pbelevich Jun 11, 2024

Choose a reason for hiding this comment

pbelevich Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

pbelevich Jun 11, 2024

Choose a reason for hiding this comment

pbelevich Jun 11, 2024

Choose a reason for hiding this comment

pbelevich Jun 11, 2024

Choose a reason for hiding this comment

pbelevich Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

pbelevich Jun 11, 2024 •

edited

Loading

pbelevich Jun 11, 2024 •

edited

Loading

pbelevich Jun 11, 2024 •

edited

Loading